Detecting, correcting, and preventing the batch effects in multi-site data, with a focus on gene expression Microarrays
نویسنده
چکیده
Gene expression microarrays are widely used to better understand the complex biological mechanisms inside cells. One of the main obstacles of applying statistical learning algorithms to microarray data is the large gap between the number of features (p) and the number of available instances (n), i.e., the “large p, small n” challenge. This thesis explores two ways to deal with this challenge. One approach is to increase n by combining similarly appropriate microarray data sets together. This is appealing as there are now many publicly available microarray studies. The main problem of this approach is the batch effect, i.e., the influence of non-biological factors on expression intensities that can confound the biological signal. As a result, combining gene expression studies without correcting for batch effects may lead to misleading findings. This thesis proposes a novel batch correction algorithm, called batch effect correction using canonical correlation analysis (BECCA), that assumes the batch effect is due to additive independent confounding factors and so utilizes canonical correlation analysis to separate technical bias from the measured biological signal. We compare BECCA to various existing batch correction algorithms using several real-world gene expression studies and find that BECCA has similar performance. The key advantage of utilizing BECCA, compared to other similar performing algorithms, is its flexibility, as BECCA allows the user to adjust how much common signal to preserve across the batches and how much batch related signal to remove from each one by changing the values of BECCA parameters. The second approach to batch correction considers the wisdom of reducing p by selecting a subset of genes. Our experiments suggest that some genes in microarray data sets contain very little biological signal, i.e., including only these genes in the calculations makes all specimens highly correlated, regardless of their tissue of origin or disease state. It is, therefore, desirable to identify and remove these misleading genes before conducing downstream analysis or batch correction. For this purpose, we propose an efficient algorithm to extend the single-study variance-based gene selection method to a multi-study gene selection algorithm. Our empirical results show this feature selection algorithm outperforms other algorithms in reducing the destructive influence of batch effects.
منابع مشابه
Effects of Over-Expression of LOC92912 Gene on Cell Cycle Progression
Background: We had previously identified the genes involved in squamous cell carcinoma of the head and neck using differential display and DNA microarray techniques. We also reported the first analytical study on a novel human gene called LOC92912, which was identified by differential display as a gene up-regulated in such carcinomas. LOC92912, which is a putative member of the E2 ubiquitin con...
متن کاملO-30: Comparing Expression Patterns of Endometrial Genes in Implantation Failures and Recurrent Miscarriages with Fertile Couples Following ICSI/IVF Using in Silico Analysis
Background: To screen and diagnose patients with recurrent abortions and implantation failure after IVF/ICSI, differentially expressed genes of endometrium through DNA microarrays were monitored. Materials and Methods: Microarray expression profile of GSE26787 dataset from GEO database was used to analyze gene expression profiles of 15 endometrial biopsy samples- five from control fertile (CF) ...
متن کاملEvaluation of the Effects of Iron Oxide Nanoparticles on Expression of TEM Type Beta-Lactamase Genes in Pseudomonas Aeruginosa
Pseudomonas aeruginosa is a common cause of surgical-site infections and healthcare-associated infections in the bloodstream, and urinary tract. Iron oxide nanoparticles (IONPs) have shown, to possess antibacterial features. The nanoparticles' status as emerging therapeutic elements has motivated investigators to assess the effects of iron nanoparticles on the expression of TEM type be...
متن کاملAssociation of morphine-induced analgesic tolerance with changes in gene expression of GluN1 and MOR1 in rat spinal cord and midbrain
Objective(s): We aimed to examine association of gene expression of MOR1 and GluN1 at mRNA level in the lumbosacral cord and midbrain with morphine tolerance in male Wistar rats. Materials and Methods: Analgesic effects of morphine administrated intraperitoneally at doses of 0.1, 1, 5 and 10 mg/kg were examined using a hot plate test in rats with and without a history of 15 days morphine (10 mg...
متن کاملMicroarray-Based RNA Profiling of Breast Cancer: Batch Effect Removal Improves Cross-Platform Consistency
Microarray is a powerful technique used extensively for gene expression analysis. Different technologies are available, but lack of standardization makes it challenging to compare and integrate data. Furthermore, batch-related biases within datasets are common but often not tackled. We have analyzed the same 234 breast cancers on two different microarray platforms. One dataset contained known b...
متن کامل